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Abstract: 

DSPSR is a high-performance, open-source, object-oriented, digital signal processing software library and 
application suite for use in radio pulsar astronomy. Written primarily in CH — h, the library implements 
an extensive range of modular algorithms that can optionally exploit both multiple-core processors and 
general-purpose graphics processing units. After over a decade of research and development, DSPSR 
is now stable and in widespread use in the community. This paper presents a detailed description of 
its functionality, justification of major design decisions, analysis of phase-coherent dispersion removal 
algorithms, and demonstration of performance on some contemporary microprocessor architectures. 

Keywords: methods: data analysis — polarisation — pulsars: general — techniques: polarimetric 



1 Introduction 

The main technical challenges that present themselves 
to pulsar astronomers arise from two opposing forces: 
the drive for greater sensitivity and the adverse effects 
of propagation through the interstellar medium (ISM). 
For a given antenna and observing schedule, greater 
sensitivity may be achieved by increased instrumen- 
tal bandwidth, which comes at the cost of additional 
signal distortion primarily due to plasma dispersion in 
the ISM. A wide variety of pulsar instrumentation has 
been developed to either mitigate or eliminate inter- 
stellar dispersion; these may be broadly classified as 
either post-detection or pre-detection techniques. 

Post-detection methods employ a filterbank, either 
analog or digital, to divide the wideband signal into 
narrow channels. The voltage signal in each frequency 
channel is detected and the inter-channel dispersion 
delays are removed before integrating over frequency. 
Post-detection dispersion removal is fundamentally lim- 
ited by residual intra-channel dispersion smearing and 
reduced temporal resolution, both of which are in- 
versely proportional to the number of filterbank chan- 
nels used. In contrast, pre-detection (or phase-coherent) 
dispersion removal, completely eliminates dispersion 
smearing while retai ning the original time reso lution of 
the observed signal (jHankins fc R ickctt 1975). Phase- 
coherent dedispersion is applied directly to the volt- 
age signal, which must be sampled at the Nyquist rate 
and either recorded or reduced using a high-speed dig- 
ital signal processing system. Owing to the relatively 
high demand on both data storage and computing re- 
sources, the earliest baseband recording and processing 
systems were limited to relatively small bandwidths. 
For example, coherent dedispersion was pioneered us- 
ing a baseband recorder with a modest bandwidth of 
125 kHz, which could record with a 20% duty cycle 
for only 3 min utes before filling the magnetic tape 
(|Hankinslll97l . 

Fortunately, the capacity of baseband recording 
and processing systems closely follows the continual 



growth of affordable, commercial computing and data 
storage technologies, and by the mid-1990s a num- 
ber of baseband recorders were in regular use at ra- 
dio observatories around the world. The Princeton 
Mark IV system, capable of c ontinuously 2-b i t sam- 
pling a 2 x 10 MHz bandpass (|Shraunedll997l : IStairs! 
11998 : IStairs etalj l2000l) , was in use at the Arecibo 
and Jodrell Bank observatories. At Parkes, baseband 
recorders included the Wide Bandwidth Digital Record- 
ing (WBD R) system, capabl e of recording a 2 x 50 MHz 
bandpass (Jenet et al. 1 9971): the S2 VL BI recorder, a 
2x16 MHz system (jWietfeldt et al.lll998l); and the Cal- 
tech Parkes Swinburne Recorder (GPSR: [van Straten et ail 
I2000T ). a 2 x 20 MHz version of the Caltech Baseband 
Recorder (CBR) at Arecibo. Each of the above sys- 
tems recorded the baseband signal to magnetic tape 
for transport and offline analysis on high-performance 
computing resources. Also in use at Green Bank, Ef- 
felsberg, and Arecibo were versions of the coherent 
Berkeley Pulsar Processor (cBPP), which employed 
digital signal processing hardware to perform coherent 
dedispersion in real time. Depending on the pulsar 
dispersion measure, these systems were c apable of cor- 
rectin g a bandpass of up to 2 x 112 MHz (|Backer et al.l 
119971 1. 

At Parkes, CPSR2 led the evolution away from 
recording on magnetic tape by using a modest cluster 
of workstations at the telescope to process 2 x 12 8 MHz 
of bandwidth in quasi- real time (Bailesl l2003l l. The 
Arecibo Signal Processor (ASP), Green Bank Astro- 
nomical Signal Processor (GASP), and Berkeley Or- 
leans Nancay processor (BON) series of instruments 
also process 2 x 128 MHz in quasi-real time, using 8- 
bit samp ling to significantly re duce quantisation dis- 
tortions l|Demorest et al.ll2004l ). The second genera- 
tion of the Dutch Pulsar Machine, PuMa-II, another 
high dynamic range system with 8- bit sampling and 2x 
160 MHz of bandwidth is now in o peration at the West- 
erbor k Synthesis Radio Telescope l|Karuppusamv et al.l 
2008). Each of the above systems use large capacity, 
high speed disks to buffer the incoming baseband sig- 
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nal before it is processed. 

In 2007, the first coherent dedispersion software to 
work completely in memory was pioneered at Parkes 
with the development of the ATNF Parkes Swinburne 
Recorder (APSR). This instrument is capable of real- 
time processing 2x1 GHz of bandwidth with 2 bits/sample, 
or 2 x 256 MHz with 8 bits/sample. APSR uses the 
same digital signal processing software that was de- 
veloped at Swinburne to reduce the baseband data 
recorded by the S2 and CPSR instruments. Later 
named dspseQ, this software evolved through the de- 
velopment of PuMa-II at WSRT, and is now utilised 
to process data from a wide variety of instruments, 
including the Australian Long Baseline Array Data 
Recorder (LBADR) and the Giant Metrewave Radi o 
Telescope (GMRT) software backend (|Rov et alJ201Ch . 
The real-time processing requirements of APSR moti- 
vated the implementation of the shared memory and 
multi-threaded capabilities of DSPSR. 

Around this time, advances in general-purpose com- 
puting on graphics processing units (GPUs) were rapidly 
transforming the state of the art of digital signal pro- 
cessing. The Green Bank Ultimate Pulsar Process- 
ing Instrument (GUPPI) first demonstrated the use of 
GPUs to perform phase-coherent dispersion removal in 
real time (P. Demorest 2009, private communication). 
Following these remarkable developments, DSPSR was 
extended to use the Compute Unified Device Archi- 
tecture (CUDA) developed by NVIDIA®0. This new 
capability was driven by the development of a GPU- 
based digital signal processing system for Parkes us- 
ing the Interconnect Break-out Board (IBOB) pro- 
duced by the Center for Astronomy Signal Processing 
and Electronics Research (CASPER) at Berkeley. The 
CASPER Parkes Swinburne Recorder (CASPSR) per- 
forms real-time phase-coherent dispersion removal on 
2 x 400 MHz using four server-class workstations, each 
equipped with two NVIDIA® Tesla C1060 GPUs. 

Through these stages of evolution and refinement, 
DSPSR has amassed an extensive range of useful fea- 
tures. These are briefly summarised in Section [2] fol- 
lowed by a more detailed description of the algorithms 
developed for radio pulsar astronomy. The perfor- 
mance of the library is demonstrated in Section|3]using 
currently available technology and common observing 
configurations. In Section [4] the future of the soft- 
ware project is discussed with regard to the upcoming 
generation of radio telescopes. 

2 Algorithms and Features 

The DSPSR software processes continuous streams of 
radio pulsar astronomical data, producing integrated 
statistics such the phase-resolved average polarisation 
of the pulsar signal. The salient features of this soft- 
ware are 

• automatic excision of invalid data, such as those lost 
during transfer or corrupted by impulsive interference; 

• correction of di gitisation distortions via d ynamic out- 
put level setting ( Jenet fc Andersonlll998h : 

-"^http: / /dspsr. sourceforge.net 
2 http://www. nvidia.com/cuda 



• pha se-coherent dispersion removal ( Hankin s fc Rickettl 
I1975T ) with optional Jones m atrix convolution f or high 
time-resolution polarimetry (jvan S tratcn 20021); 

• synthetic filterbank formation with concurrent co- 
herent dedispersion and/or Jones matrix convolution; 

• full-polarimetric detection; 

• computation of phase-resolved averages (folding) us- 
ing either a polynomial approximation to the pulsar 
phase model, a constant period, or acceleration search 
parameters; 

• formation of sub-integrations of arbitrary length, in- 
cluding single pulses; 

• simultaneous folding of multiple pulsars, such as glob- 
ular cluster or double pulsars; 

• exploitation of multiple processing cores using par- 
allel computing threads; 

• accelerated computation on graphics processing units; 

• transparent buffering of input data to handle the 
edge effects of certain algorithms, such as the overlap- 
save method of discrete convolution; 

• real-time signal processing directly from a ring buffer 
in shared memory; and 

• time-division multiplexing: discontiguous segments 
of data may be processed as a single stream to min- 
imise initialisation time. 

The following sections describe these features in 
greater detail while outlining a typical pulsar signal 
processing path. 

2.1 Invalid Data Excision 

Any operation may flag sections of data as invalid, such 
that they are ignored by subsequent components in the 
signal processing chain. This design feature is imple- 
mented as an array of weights that is maintained in 
parallel with the blocks of data that represent the as- 
tronomical signal. The weights array is typically used 
to flag segments of data in the time domain, such that 
a single weight applies to all frequency channels and 
polarisations over some epoch. However, the weights 
array can also be used to flag one or more specific fre- 
quency channels as invalid. 

Many of the classes used to convert between n- 
bit and floating point representations of the digitised 
signal monitor the data and flag sections as invalid 
whenever certain statistics (such as the noise power) 
fall outside the acceptable ranges of operation. The 
weights array is used by operations that perform in- 
tegrations (such as the pulse profile folding operation 
described in Section [2U)| to ensure that final results are 
not corrupted by invalid data. For example, invalid 
data may be recorded before the instrumentation is 
properly initialised. As illustrated in Figure [T] strong 
bursts of impulsive interference such as lightning are 
completely excised using this technique. 

2.2 Dynamic Output Level Setting 

Analog-to-digital conversion is a non-linear process that 
introduces both noise and signal distortion. Dynamic 
output level setting aims to correct the quantisation 
distortion by restoring the line ar relationship between 
digitised and undigitised power (| Jenet fc Andersonl 19981 . 
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Figure 1: Automatic impulsive interference ex- 
cision by DSPSR during a lightning storm. A 
noise diode coupled to the receptors of the Parkes 
20 cm Multibcam receiver was modulated by a 
square wave with a 50% duty cycle. This cali- 
brator signal was observed at a centre frequency 
of 1369 MHz using the Parkes Digital Filterbank 
(PDFB3, above) and the ATNF Parkes Swinburne 
Recorder (APSR, below). Both instruments ob- 
served the same 2 x 256 MHz band and the flux 
scale in each plot is uncalibrated. Each of the four 
30-second integrations output by PDFB3 are irre- 
versibly corrupted by broadband impulsive noise. 
However, by discarding only the contaminated seg- 
ments of APSR data (less than 0.05% of the sig- 
nal) the 2-minute integration produced by DSPSR 
is greatly improved. 



hereafter JA98). In DSPSR this correction is performed 
during the conversion from 2-bit to floating point rep- 
resentations of the digitised voltage samples. Referring 
to section 6 of JA98, the data are divided into consecu- 
tive segments of L digitised samples. In each segment, 
the number of low voltage states, M, is counted and 
the undigitised power, a 2 , is estimated by inverting 
equation 45 of JA98, 



A M 

$ = — = erf 
lj 

where the error function, 



erf (a;) = — — 

\/7T 



V^cr 



- y dy, 



(1) 



(2) 



and t is the optimal threshold between low and high 
voltage states. Equation ([1]) is solved for a using the 
Newton-Raphson method, and it is computationally 
prohibitive to perform this iterative calculation for ev- 
ery L-point segment. However, as there are only L + l 
possible values of M, DSPSR utilises a lookup table of 
precomputed output levels, which are stored for only 
an acceptable range of input power estimates. Those 
segments of data for which M falls outside of the ac- 
ceptable range are flagged as invalid (as described in 
more detail the next section). DSPSR also maintains a 
histogram of occurrences of M that is archived for later 
use as a diagnostic tool when assessing the quality of 
the baseband recording. 

2.3 Coherent Dedispersion 

Electromagnetic waves propagating through the inter- 
stellar medium (ISM) experience phase dispersion that 
effects a frequency-dependent group velocity. Describ- 
ing the ISM as a cold, tenuous plasma, the frequency 
response function, 



H(y + v ) = exp 



(3) 



is obta ined and used to deconvolve the observe d radio 
signal (|Hankinslll97ll : lHankins fc Ricketdll975h . Here, 
vo is the centre frequency of the observation and the 
dispersion D is related to the more commonly used dis - 
persion measure DM by l|Manchester fc Tavlorlll972h 

DM(pccm" 3 ) = 2.41 x 10~ 4 D (s MHz 2 ). (4) 

The dispersion measure is equal to the integral of the 
free electron density along the line of sight to the pul- 
sar. Although the constant of proportionality has been 
derived with greater precision ^Backer et al.lll993l ). Equa- 
tion Q is the standard adopted by pulsar astronomers 
and implemented by DSPSR. 

The duration of the phase dispersion impulse re- 
sponse function is called the sweep or smearing time, 
td- It is equivalent to the width of a band- limited 
delta-function after passage through the ISM, and is a 
function of the bandwidth, 5v, and centre frequency, 
vq , such that 



-2 \ 
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where v m i n = vq — Sv/2, and ^ ma x = vo + 8v/2. 

Deconvolution is most efficiently performed in the 
frequency domain, where the observed signal is sim- 
ply multiplied by the inverse of the discrete form of 
Equation Q. This results in cyclical convo lution, as 
described in Chapter 18 of iBrace welll (Il98 6f) an d Sec- 
tion 13.1 of Numerical Recipes (jPress et al.lll992l . here- 
after NR). Figure [2] (cf. Figure 13.1.3 of NR) illus- 
trates that each time sample output by the cyclical 
convolution operation will depend upon the = rti 
points preceding it and the nj — rt~[ points follow- 
ing it, where r is the sampling rate and t\ and tj 
are the dispersion smearing times in the upper and 
lower halves of the band. Owing to the periodicity as- 
sumption of the discrete Fourier transform, the first n\ 
points in the result of the convolution operation will 
be erroneously mixed with data wrapped around from 
the end of the data segment, and the last nj points of 
the result will be erroneously mixed with time samples 
from t he beginning of the input segment (cf. Figure 
18.4 of IBracewelll [19 86) . Consequently, the polluted 
points (called the wrap-around region) from the result 
of each transformation step are discarded, a technique 
that forms part of the overlap-save method of discrete 
convolution. This method, depicted in Figure [3] also 
accounts for the fact that the duration of the impulse 
response function is generally much shorter than that 
of the signal to be convolved. 

Noting the quadratic form of Equation ©, it can 
be seen that the lowest frequency must be advanced 
by an amount greater than the delay applied to the 
highest frequency. This results in asymmetry of h(ti) 
around t = 0, which must be accounted when discard- 
ing wrap-around points from the result of each cyclical 
convolution operation. As na = +nj" points are dis- 
carded, the number of samples TV in the result of the 
Fast Fourier Transform (FFT) must be greater than 
nd- TV is chosen by minimising the number of floating 
point operations per second, 



flops = — 25v, 

N -n d 



(6) 



where Offt(TV) is the number of floating point opera- 
tions required to compute the FFT of TV complex input 
samples, which asymptotically approaches 5TV log 2 TV 
when TV is an integer power of two^. This equation 
is valid for both real-valued and complex- valued input 
data (see the Appendix for discussion). The factor of 2 
arises because both forward and backward transforms 
are computed. Equation ((6j) provides an estimate of 
the minimum number of flops required to process the 
data in real time. This number can be directly com- 
pared with FFT benchmarks, such as those computed 
by benchFFTQ. As discussed in more detail in Sec- 
tion [3] actual performance measurements can be used 
to better select the optimal FFT length. 

To perform the FFT, DSPSR can utilise both open 
source and proprietary FFT libraries, inclu ding the 




Figure 2: The magnitude of the dedispersion im- 
pulse response function, \h~ (tj)|, as calculated 
from the inverse FFT of the frequency response 
function, 7J _1 (i/fc), and normalized to unity at the 
origin (t = 0). For this example, z/o=600 MHz, 
(5^=20 MHz, DM=2, and TV=65536 points. Re- 
flecting the cyclicity of discrete convolution, time 
is labelled in both the positive (bottom axis) and 
negative (top axis) directions with respect to the 
origin, which recurs at the left and right bound- 
aries of the plot. At the edges of the response func- 
tion, where the magnitude drops to zero, the ring- 
ing is due to the cyclic discontinuity in i? _1 (t'fc) 
at v = 0. The smearing in the upper half of the 
band, = 750tts, and lower half of the band, 
= 788/is, result in asymmetry about the origin. 
This asymmetry also causes the observed slope in 
the magnitude of the impulse response; the flux at 
low frequency is smeared over more time than the 
flux at high frequency, but the integrated fluxes in 
the two halves of the band are equal. Note that 
and 47 correspond to m + and m_ in Figure 13.1.3 
of Numerical Recipes. 



Faste st Fourier Transform in the West (FFTW; lFrigo fc Johnson! 
120051 1 the Intel® Math Kernel LibrarjQ and Intel® 

3 http:/ /www. fftw.org/speed 

4 http: / /www. fftw.org/benchfft 

5 http: / / www.intel.com/software / products / mkl 
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Figure 3: The overlap-save method of discrete convolution. The input data (top) are divided into ./V-point 
segments that overlap by the number of non-zero points, rid, m the discrete phase dispersion impulse 
response function. Each TV-point segment is separately transformed into the Fourier domain, multiplied 
by the frequency response function, i? -1 ^), and transformed back into the time domain (middle). 
Polluted points from the wrap-around region (shaded grey) are discarded before copying the result into 
the output data stream (bottom). 



Integrated Performance Primitives On alloca- 

tion, all arrays are aligned on 16-byte boundaries so 
that vector instruction sets, such as Streaming SIMD 
Extensions (SSE), may be exploited. 

Invalid Data Excision: Referring to Figure [31 if 
any of the input voltage data in an Appoint segment 
are flagged as invalid, then the all of the N—rid samples 
of the output voltage data for that cyclical convolution 
step will be flagged. 



2.4 Synthetic Filterbank 

Though phase-coherent dispersion removal retains the 
original temporal resolution of the digitised signal, this 
resolution is often integrated away when forming the 
average pulse profile. It therefore proves useful to trade 
some time resolution for frequency resolution by divid- 
ing the observed band into a number of sub-bands, 
or channels. Both the smearing time, td, and the 
sampling rate, r, in each sub-band is reduced, result- 
ing in smaller n<j; this reduces the transform length, 
N, required for subsequent coherent dedispersion, and 
thereby improves computational efficiency. Further- 
more, with greater frequency resolution the measured 
polarisation may be better calibrated, or individual 
sub-bands corrupted by narrow-band radio-frequency 
interference (RFI) may be deleted. Finally, when fre- 
quency resolution is retained, a better estimate of the 
dispersion measure may be applied at a later time to 
correct the inter-channel dispersion delays before fur- 
ther integrating in frequency. Using DSPSR, the ob- 
served signal may be divided into N c frequency chan- 
nels using either the deprecated or convolving filter- 
bank, both of which are based on the FFT. 



6 http: / / www.intel.com/software / products/ipp 



2.4.1 Deprecated Filterbank 

In the deprecated filterbank, the data are divided into 
non-overlapping segments of N c complex points (or 
2N C real points). Each segment is transformed into the 
Fourier domain, where each of the N c complex spectral 
values is treated as a time sample in one of N c inde- 
pendent signals, each with bandwidth, 5v' = 8v/N c , 
and sampling rate, r' = r/N c . Phase-coherent disper- 
sion removal is then separately performed on each of 
the N c resulting channels using unique phase disper- 
sion frequency response functions, each tuned to the 
centre frequencies of the output filterbank channels. 

This method of synthetic filterbank formation is 
equivalent to the coheren t filter bank described in Sec- 
tion 3.1.3 of IJenet et al.l (|l997l i. Although it reduces 
the computational cost of coherent dedispersion (see 
the following section and the Appendix for further dis- 
cussion), the technique suffers from the spectral leak- 
age of the discrete Fourier transform, which mixes a 
significant amount of power between neighbouring fre- 
quency channels (NR, Section 13.4). When the inter- 
channel dispersion delay is large, the artifacts of spec- 
tral leakage are readily observed as delayed images of 
the pulse profile, as shown in Figure 3] 

Here, the dispersion delay between the 500 kHz 
channels of the deprecated filterbank (which ranges 
from ~ 88 to 154 /is across the band) is seen between 
the peaks (main and inter-pulse) and the leakage arti- 
facts on either side of them. The remaining artifacts in 
the APSR convolving filterbank are introduced by the 
Parkes Digital Filterbank (PDFB3), which is used to 
digitise and divide the 256 MHz band into 16 smaller 
16 MHz bands. The two-stage (1024-channel analysis 
followed by 16 x 64-channel synthesis) polyphase filter- 
bank implemented by the PDFB3 introduces spectral 
leakage between the 250 kHz channels produced by the 
analysis filterbank; the dispersion delay between these 
channels ranges from ~ 44 to 77 fj,s across the band. 
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Work is currently underway to correct this artifact us- 
ing a single-stage 16-channel analysis polyphase filter- 
bank. 

In contrast, CASPSR uses time-division demulti- 
plexing to implement parallel processing on multiple 
GPUs; therefore, spectral leakage is limited to that in- 
troduced by the forward 512 k-point FFT in the con- 
volving interbank used to process these data. At this 
frequency resolution, the dispersion delay between neigh- 
bouring channels varies between only ~ 0.11 and 0.27 
fis, which is less than the temporal resolution of the 
folded profile. 

2.4.2 Convolving Filterbank 

The spectral leakage function may be improved by the 
application of a window function in the time domain 
(NR, Fig. 13.4.2). However, the physical bandwidth of 
the spectral leakage function may also be decreased by 
increasing the length of the Fourier transform by a fac- 
tor, AT'. Where K — N' x N c , each Tf-point segment 
of data is transformed into the Fourier domain, and 
the result is divided into N c non-overlapping segments 
of N' spectral points. Each segment is inverse trans- 
formed back into the time domain, producing N' time 
samples from N c independent filterbank channels. 

Coherent dedispersion may be performed simulta- 
neously with this filterbank technique. While in the 
Fourier domain, each of the N c spectral segments is 
multiplied by a unique dedispersion frequency response 
function. In this case, N' must be chosen to match 
the size of the response function. As the same inverse 
transform size is applied to each sub-band, N' must be 
large enough to accommodate the maximum smearing 
time, t' d = D(u^ n — i^mlx), corresponding to the sub- 
band with the lowest centre frequency, where Vmin = 
— Sv/2, and = v m i n + 5v/N c . Owing to the 

quadratic form of Equation (JSJ, t' d > t d /N c . Although 
the smearing is less severe in those sub-bands with 
higher centre frequencies, symmetry dictates that n' d = 
r't' d > n d /Nc wrap-around points are discarded from 
the result of each of the N c inverse transforms (cycli- 
cal convolution products). Each A'-point segment of 
input data therefore overlaps by N c x n' d points. 

As shown in the Appendix, phase-coherent disper- 
sion removal using either the deprecated or convolving 
filterbank methods requires the same number of float- 
ing point operations per second, 

flops' = ^(lo g2 N ; + 2lo g2 N')^ (7) 

Note that the size of the first FFT, K = N c x N' > 
n d /N c , is inversely proportional to the number of fil- 
terbank channels, N c . As smaller transform lengths 
improve processing efficiency, N c should be chosen to 
be as large as possible without sacrificing required time 
resolution. This remains true only up to the limit 
where the resulting sampling interval, t' s = 1/r', be- 
comes larger than t' d (i.e. where n' d < 1). 

As shown in Figure 3J the convolving filterbank 
method eliminates the spectral leakage artifacts pro- 
duced using the deprecated method. Synthetic filter- 
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Figure 4: Correction of spectral leakage using 
the convolving filterbank. PSRB1937+21 was ob- 
served at a centre frequency of 1369 MHz using 
the 2 x 256 MHz band of APSR (top two panels) 
and at a centre frequency of 1382 MHz using the 
2 x 400 MHz band of CASPSR (bottom panel). 
The APSR data were processed using a synthetic 
filterbank as implemented by the deprecated (top 
panel) and convolving (middle panel) methods. 
The residual distortion in the APSR data (most 
obvious is the triangular base of the main pulse 
in the middle panel) is due to the Parkes Digi- 
tal Filterbank, as described in the text. The un- 
corrupted CASPSR observation best demonstrates 
the fidelity of the convolving filterbank. 
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bank formation with simultaneous convolution is sum- 
marised by the following algorithm: 

1. Divide the signal into if -point segments (K = 
N c x N') that overlap by N c x n' d points (cf. top 
of Figure [3J). 

2. For each segment, perform the 7f -point FFT, 
and divide the result into N c non-overlapping 
segments, or sub-bands, of N' points. 

3. For each of the iV c sub-bands: 

(a) multiply the iV'-point sub-band by its unique 
frequency response function; 

(b) perform an iV'-point inverse FFT; 

(c) discard the n' d wrap-around points; and 

(d) copy the remaining time samples into the 
corresponding output frequency channel. 

Invalid Data Excision: In addition to the dis- 
cussion of invalid data excision in Section llOl synthetic 
filterbank formation reduces the number of time sam- 
ples per weight by a factor of iV c . If N c is greater than 
the number of samples per weight, then the weights 
array will be rebinned with care to ensure that invalid 
data remain nagged. 

2.5 Detection 

If the voltage signals from both receptors of a dual- 
polarisation receiver are available, DSPSR computes the 
polarisation of the electromagnetic radiation. Radio 
pulsar polarimetry yields informa tion about the physics 
of th e emission mechanism (e.g. lEdwards fc Stappers! 
20041) . the geometry o f the pulsar's magnetosphere (e.g. 



Johnston et alj|2005f ). and the avera ge magneti c field 



of the ISM along the line of sight (e.g. lHan et all2 006). 
It also provides additional constraints tha t may be ex- 
ploite d for high-precision pulsar timing l|van Stratenl 
l2006h . 

Polarisation is described by the second-order statis- 
tics of the transverse electric field vector e, as given 
by the complex 2 x 2 coherency matrix, p = (ee'J 
(jBorn fc Wohl 119801. Here, the angular brackets de- 
note an ensemble average, e 1 " is the Hermitian trans- 
pose of e, and an outer product is implied by the 
treatment of e as a column vector. A pulse phase- 
resolved ensemble average is performed by the folding 
operation described in the next section. When detect- 
ing synthetic filterbank data, the coherency matrix is 
computed independently in each frequency channel. 

A useful geometric relationship between the com- 
plex two-dimensional space of the coherency matrix 
and the real four-dimensional space of the Stokes pa- 
rameters is expressed by the following pair of equa- 
tions: 



P 



Sk f fc/2 
tr(o- fc p). 



(8) 
(9) 



Here, Sk are the four Stokes parameters, Einstein no- 
tation is used to imply a sum over repeated indeces, 
< k < 3, (To is the 2x2 identity matrix, cri_3 are 
the Pauli matrices, and tr is the matrix trace operator. 



The Stokes four-vector is composed of the total inten- 
sity 5*o and the polarisation vector, S = (Si, S2, S3). 
Equation (|8j) expresses the coherency matrix as a linear 
combination of Hermitian basis matrices; equation ([9]| 
represents the Stokes parameters as the projections of 
the coherency matrix onto the basis matrices. 

The fourth-order moments of the electric field, as 
described by four-dimensional covariance matrix of the 
Stokes parameters, can optionally be formed and aver- 
aged as a function of pulse phase. The covariances of 
the Stokes parameters provide additional information 
about the pulsar radiation, such as the presence of or- 
thogonally polarised modes of emission and the degree 

of correlation between m ode intensities (IMcKinnon fc Stinebrina 
119981 : Ivan Stratenll2009T ). 

2.6 Folding 

Ensemble-averages of the detected signal are computed 
as a function of topocentric pulse phase during a pro- 
cess commonly called folding. Apparent topocentric 
pulse phase is determined using a polynomial approx- 
imation to the pulsar ti ming model, Vd,(i), gener ated 
using eit her the tempo (|Tavlor fc Weisber"3ll989l f1 or 
tempo2 l|Hobbs et al.ll2006l 1 a l software packages. The 
detected signal of interest, Y(ti), is folded by integrat- 
ing each of its samples into one of n equally spaced 
phase bins, Y((f>k), where bin number, k, is given by 



k = (riP^iti)) modn, 



(10) 



where V4, has dimensionless phase (turns) and the mod 
operator returns the remainder after division, such that 
< k < n. 

Typically, a vector of processes, Y(U) is folded in 
parallel with one count, N((/>k), of the number of time- 
samples accumulated in each phase bin. For example, 
Y(ti) may represent the four Stokes parameters in each 
of the N c frequency channels. The average pulse profile 
of each process is then given by 



[Y j (<t> k ))=Y j {cl> h )/N{4> k ) 



(11) 



The start and end times, to and tjy, of the observation 
are used to assign the mean time of the rising edge of 
phase bin zero. Under the assumption that V$ and its 
inverse may be calculated with sufficiently high accu- 
racy, this is given by 

To=T^ 1 {T 4 ,{(to + t N )/2)modl) (12) 

During the folding operation, DSPSR can divide the 
signal into segments with arbitrary start and end times 
and produce a unique set of pulse profiles for each seg- 
ment. This feature can be used to divide a long obser- 
vation into regular intervals, or sub-integrations, that 
may be later combined using an updated version of 
the pulsar timing model, or following the deletion of 
sub-integrations that have been corrupted by instru- 
mentation faults or radio frequency interference. 

The start and end times of each sub-integration can 
also be made to coincide (within the temporal resolu- 
tion of the signal) with integer values of pulse phase, 



7 http: / /pulsar. princeton.edu/tempo 

8 http: / / www. atnf. csiro. au / research / pulsar / tempo2 
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thereby producing single-pulse profiles. In single-pulse 
mode, it is usually necessary to remove the inter-channel 
dispersion delays, such that each single pulse integra- 
tion output by DSPSR contains the same pulse in all 
frequency channels. It may also be necessary to ad- 
just the reference phase of phase bin zero so that the 
on-pulse region of pulse phase is not divided across 
multiple integrations. 

For pulsars with short spin periods, such as mil- 
lisecond pulsars, it may not be feasible to store every 
single pulse with full frequency resolution. In such 
cases, it becomes desirable to either integrate the data 
down to some manageable size or add some conditions 
such that only a subset of the single pulse integrations 
are kept. To this e nd, dspsr employs the psrchive 
(jHotan et al.l [20 04b ) command language interpreted 
to run a user-supplied script, through which a diverse 
range of post-processing tasks can be executed. For ex- 
ample, when studying the polarisation of single pulses, 
the psrchive script can be used to calibrate before in- 
tegrating over frequency channels. When searching for 
giant pulses, the psrchive script can test if the single 
pulse contains any signal that is a certain threshold 
above the noise. 

Multiple pulsars can be folded simultaneously, which 
is particularly useful when observing globular cluster 
or double pulsar systems. When folding multiple pul- 
sars, the end of the signal path is forked across multiple 
instances of the folding operation. That is, the signal 
is processed identically for each pulsar up to the final 
stage of folding. Therefore, when performing phase- 
coherent dispersion removal on a set of globular clus- 
ter pulsars, the mean or median dispersion measure 
should be used. 

When parallel threads of execution are used, each 
thread folds its blocks of data into its own integration 
of the pulse profile. On completion of an integration, 
each thread places its result on a stack where the folded 
results from all threads await combination. When no 
threads have data to contribute to an integration, it is 
processed as in the single-threaded mode of operation. 

Invalid Data Excision: Any data that are flagged 
as invalid are omitted from integration. This feature 
can be disabled when studying bright pulsars and/or 
giant pulses. 



to satisfy the block size and overlap constraints of more 
than one such operation in the signal chain. Therefore, 
each DSPSR operation manages its own overlap require- 
ments using a thread-safe input buffering scheme. This 
design feature simultaneously increases both the mod- 
ularity and efficiency of the code by reducing interde- 
pendency between signal processing components and 
eliminating the need to re-process data lost by opera- 
tions that occur later in the signal path. 

Each operation that loses samples simply buffers 
the appropriate amount of data from the end of its 
current input data block to be later prefixed to the 
next input data block passed to the operation. A small 
amount of additional book-keeping ensures the conti- 
guity of consecutive input data blocks passed to each 
operation. When multiple processing threads are used, 
the input data blocks are distributed across multiple 
signal processing paths. Consequently, in each thread, 
the blocks passed on consecutive calls to an opera- 
tion are not necessarily contiguous. Therefore, for 
each component in the signal chain that requires it, 
the threads share a single input buffer through which 
the data to be prefixed are passed to the thread that 
receives the next contiguous input block. This method 
works only when the processing threads share the same 
physical memory. When threads are run on different 
computational devices, such as multiple graphics pro- 
cessing units (CPUs), and the cost of transmitting the 
buffered data between devices is high, then DSPSR will 
resort to the raw input data overlap strategy. 

3 Performance 

Digital signal processing is demanding of computa- 
tional, data communications, and data storage resources. 
Therefore, the performance of a baseband recording 
and processing system depends on a number of vari- 
ables, including the number of microprocessors, their 
architecture and clock speed, memory size and band- 
width, cache size and bandwidth, compiler version, 
and optimisation options. To demonstrate some of 
these basic considerations, a number of illustrative bench- 
marks are presented using currently available micro- 
processor technology. 



2.7 Thread-safe Input Buffering 

One or more of the operations in the signal processing 
chain may lose samples, such that the time-bandwidth 
product of the output of the operation is less than 
that of its input. For example, for each block of input 
data, coherent dedispersion discards at least nd sam- 
ples and will lose more if the input block size is not an 
integer number of overlapping TV-point segments. One 
possible solution is to set the raw input data block 
size and load overlapping blocks of raw input data ac- 
cordingly. When only one of the operations imposes 
a constraint on the raw input data block size and/or 
amount of raw input data block overlap, it is trivial to 
set these parameters; however, it may not be possible 



3.1 Central Processing Units 

For the most computationally intensive operations in 
DSPSR, such as coherent dedispersion and filterbank 
synthesis, execution time is dominated by computing 
the Fast Fourier Transform (FFT); therefore, the rate 
at which data are processed is a strong function of 
the performance of the FFT library. In Figure [5] the 
speeds of two commonly used FFT libraries are plotted 
as a function of transform length, JVfft, and the num- 
ber of operating threads. Similar to the benchFFT 
convention, performance in Gigaflops is defined as 



Gfiops : 



5iY" FF T l0g 2 (7VFFT) 



(13) 



9 http: / / psrchive.sourceforge.net / manuals / psrsh 



where t ns is the average time required to perform a 
single FFT in nanoseconds. These benchmarks were 
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Figure 5: Fast Fourier Transform performance as 
a function of transform length, TVfft- Speed is 
measured in effective Gigaflops per processing core 
as described in the text. Version 3.1.2 of the 
Fastest Fourier Transform in the West (FFTW) 
and version 5.3.3 of the Intel® Integrated Perfor- 
mance Primitives (IPP) libraries were used. Each 
panel displays the results of performing a single- 
precision, complex-to-complex FFT on each of one 
(plus), two (cross), four (square), and eight (circle) 
processing threads. Each thread performs a sep- 
arate FFT in isolation; therefore, the aggregate 
speed of the microprocessor is the value shown 
multiplied by the number of threads. 

performed on a workstation with dual Intel u Xeon® 
E5345 (Clovertown) processors, each with 4 cores run- 
ning at 2.33 GHz, 4 x 64 kB of level 1 (LI) cache, and 2 
x 4 MB of level 2 (L2) cache, connected via 10.66 GB/s 
bus to 16 GB of Random Access Memory (RAM). 

Figure [S] demonstrates the impact of LI and L2 
cache on FFT performance. FFT speed increases with 
greater FFT length until the 32 kB of LI cache re- 
served for data is filled (note that A?fft = 2 12 = 4k 
corresponds to 64 kB of memory required for input and 
output arrays). At this point the speed drops (most 
notably for FFTW) and remains constant before drop- 
ping again when the L2 cache is filled (-/Vfft = 2 18 = 
256 k corresponds to 4 MB of memory). As each of the 
8 processing cores is equipped with its own LI cache, 
there are no significant differences in FFT performance 
as a function of the number of operating threads for 
Nfft < 2 17 . However, because each L2 cache is shared 
by two processing cores, the effective L2 cache per 



core is halved when 8 processing threads are used; in 
this case, as seen in FigureJS] performance drops when 
iVpFT = 2 17 = 128k. 

Using the performance measurements plotted in 
Figure [5l DSPSR can select both the optimal trans- 
form length and the best FFT library to employ during 
phase-coherent dispersion removal and synthetic filter- 
bank formation. The selection is made by eliminating 
the Offt(N) ~ 5iVlog 2 7V assumption and using the 
measured time required to perform an TV-point FFT 
while minimising Equation ((6j) or Equation (J7J). Par- 
ticularly for shorter transform lengths, performance is 
significantly improved by this optimisation. 

Overall performance is evaluated by measuring the 
time required to completely process a block of data 
normalised by the real time spanned by that block of 
data (the real time is equal to the sampling interval 
times the number of samples in the block). Figure [6] 
plots the processing time to real time ratio as a func- 
tion of dispersion measure for four different instrumen- 
tal configurations. For each configuration, the pro- 
cessing time is that required to convert 8-bit digitised 
data to single-precision floating-point complex num- 
bers, form a niterbank with 500 kHz resolution while 
performing phase-coherent dispersion removal, detect 
the polarisation of the signal, and fold all frequency 
channels and polarisation parameters with a resolution 
of 1024 phase bins. 

At low DM in Figure El where the FFT length is 
smaller than the L2 cache, DSPSR performance scales 
roughly linearly with the number of processing threads. 
At the largest DM, 8 threads perform slightly worse 
than 4 threads owing to competition for L2 cache (on 
the Clovertown, each L2 cache is shared by two pro- 
cessing cores). To further illustrate the importance of 
cache size and memory bandwidth, the same bench- 
mark was repeated on a workstation with dual Intel® 
Xeon® E5520 (Nehalem-EP) processors, each with 4 
cores running at 2.26 GHz, 4 x 64 kB of LI cache, 4 x 
256 kB of mid-level (L2) cache, and 8MB of L3 cache, 
connected via 25.6 GB/s bus to 24 GB of RAM. The 
results are plotted in Figure[7| Performance at the low- 
est DM is not significantly improved, indicating that 
the problem is bound by the processor speed in this 
regime. At the highest DM, where the FFT no longer 
fits in L2 cache, data are processed as much as 4 times 
faster on the E5520, indicating that memory band- 
width must be the limiting factor on the E5345. 

The relatively modest bandwidths used for the per- 
formance benchmarks presented in Figures [S] and [7] are 
roughly an order of magnitude smaller than currently 
available with modern pulsar instrumentation. To pro- 
cess larger bandwidths in real time, the digitised sig- 
nal must be divided between multiple workstations, 
the number of which depends upon the method of di- 
vision. Using time-division demultiplexing, each ma- 
chine must process the full bandwidth for some fraction 
of the time. Either the frame size must be relatively 
large or the frames must overlap in time to compensate 
for the data lost to edge effects, which may necessitate 
the use of one or more intermediate workstations to 
buffer the frames. Using frequency-division demulti- 
plexing, each machine must process some fraction of 
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Figure 6: Processing time to real time ratio on 
dual Intel® Xeon® E5345 processors as a function 
of dispersion measure DM, centre frequency u, 
and bandwidth Sv. The benchmark was performed 
using one (top), four (middle), and eight (bottom) 
processing threads. Each panel plots four instru- 
mental configurations, including v — 3000 MHz 
with 8v = 64 MHz (plus), v = 1500MHz with 
6v = 32 MHz (cross), v = 750 MHz with 5v = 
16 MHz (square), and v = 375 MHz with Sv = 
8 MHz (circle). Processing time is the total time 
required to perform all steps in the typical pul- 
sar signal processing path described in the text. 
Fourier transforms were performed using the op- 
timal transform length and the best of either the 
FFTW or IPP libraries. 



R 
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Figure 7: Processing time to real time ratio on 
dual Intel® Xeon® E5520 processors, as in Fig- 
ure El 
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the bandwidth for the full time. To divide the sig- 
nal into frequency channels, some form of filterbank 
(such as a polyphase filterbank implemented on a field- 
programmable gate array) must be used. 

When designing a digital signal processing software- 
based instrument, the number of workstations required 
to process the desired bandwidth in real time may 
be estimated using a benchmark similar to that pre- 
sented in Figure [8] For example, considering the top 
panel (3 GHz), an instrument that uses time-division 
demultiplexing to process a 512 MHz band (circles) 
out to a DM of 1000 will require at least 11 worksta- 
tions. If the signal is first divided into four sub-bands 
(crosses), only 8 workstations (2 per sub-band) would 
be required. 



3.2 Graphics Processing Units 

On a multiprocessor device such as a graphics process- 
ing unit (GPU), the greatest performance is achieved 
when all of the processors are fully utilised. To max- 
imise utilisation, the NVIDIA® CUDA FFT library 
(CUFFT) enables parallel (batched) execution of mul- 
tiple transforms. This feature is exploited by the GPU- 
based implementation of the convolving filterbank al- 
gorithm to perform the N c inverse iV'-point FFTs in 
parallel (refer to the algorithm at the end of Section l2,4.2H . 
Consequently, FFT performance is a function of both 
N c and JVfft — N' , as shown in Figure [9] Here, per- 
formance in Gigaflops is defined as 

Gflops = 57V e jV F FT(log 2 iVc + 2 log 2 TVfft) ^ 

ins 

where the number of floating point operations is as 
derived in the Appendix and t ns is the average time 
in nanoseconds required to perform the forward and 
backward transforms in each step of the convolving fil- 
terbank algorithm. These benchmarks were performed 
on a workstation equipped with an NVIDIA® Fermi 
architecture Tesla C2050 GPU. In contrast to the bench- 
marks presented in Figure [SJ where memory band- 
width and cache size limits performance at large JVfft, 
CUFFT performs best when N c x JVfft is large be- 
cause greater microprocessor utilisation is achieved. 

Optimal GPU performance also requires minimis- 
ing the amount of data transferred between the de- 
vice and host memory. That is, once data are on the 
GPU, it is best to reduce them as much as practi- 
cable on the device. To this end, DSPSR implements 
GPU-based detection and folding algorithms, so that 
only the integrated results are transferred back to host 
memory. The folding algorithm exploits hardware- 
level multithreading to maximise utilisation by launch- 
ing a processing thread for each frequency channel, 
polarisation, and pulse phase bin, such that the total 
number of independent sums to be computed is much 
greater than the number of available cores (448 on the 
Fermi architecture). By scheduling many more threads 
than there are processors, the latency in thread exe- 
cution incurred during retrieval of data from off-chip 
memory is hidden by switching between active exe- 
cution contexts (or warps: atomic groups of threads 
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Figure 8: Processing time to real time ratio on 
dual Intel® Xeon® E5520 processors as a function 
of dispersion measure DAI . The centre frequency 
v varies from 375 MHz (bottom panel) to 3 GHz 
(top panel). In the bottom panel, the bandwidth 
8v varies as 8v — 8 MHz (triangle), 8v — 16 MHz 
(cross), 8v = 32 MHz (square), and 8v = 64 MHz 
(circle). Each symbol corresponds to the same rel- 
ative bandwidth 8vjv in each panel. Processing 
time is defined as in Figure |H1 
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10 15 
log 2 (N FFT ) 

Figure 9: Filtcrbank performance as a function 
of inverse transform length iVpFT and number of 
frequency channels N c . Speed is measured in ef- 
fective Gigaflops for a N c — 4 (plus), N c = 32 
(cross), N c = 256 (square), and N c = 2048 (circle) 
filterbank up to the limit of log 2 (-/V c 7VFFT) = 24. 
Version 3.0 of the NVIDIA® CUDA FFT library 
(CUFFT) was used. 



that execute simultaneously). Note that on the single- 
instruction, multiple-thread (SIMT) architecture em- 
ployed by streaming multiprocessors, separate regis- 
ters are allocated to each active thread; therefore, there 
is no cost associated with switching between active 
warps. If the maximum amount of data that fits into 
GPU memory corresponds to less than one pulse pe- 
riod and the number of frequency channels is small, 
then some fraction of the processors may be idle; there- 
fore, the current folding algorithm implementation per- 
forms best on pulsars with short spin periods. 

The overall performance of a contemporary GPU- 
based instrument can be estimated using the bench- 
marks presented in Figure 1101 Focusing once again 
on the top panel (3 GHz), to process a 512 MHz band 
out to a DM of 1000 in real time requires only 4 GPU- 
equipped workstations, regardless of time- or frequency- 
division demultiplexing. For reference, the spin period 
of the pulsar used in these benchmarks is ~ 5.76 ms. 

The software used to produce the benchmarks pre- 
sented in this section is distributed as part of the DSPSR 
package. To facilitate instrumental design and plan- 
ning, the latest available performance benchmarks are 
maintained onlinj^l. 
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4 Conclusion 



DSPSR is a high-performance, general-purpose tool for 
radio astronomical data reduction that has enabled a 
diverse range of experiments, includin g the discovery 
of th e giant micropulse phenomenon I Johnston et al.l 



2001 



, a survey for sub- millisecond p ulsars (lEdwards et al.l 
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2001 



, high-precision pu l sar timing (f var^Straten^et^fi 
Hotan et al. 2004a ;|Jacobv et al.ll2005l;IOrd et a! 



Figure 10: Processing time to real time ratio on 
an NVIDIA® Fermi architecture Tesla C2050 as a 
function of dispersion measure DM, as in FigurejH 



3 http: / /dspsr. sourceforge.net/bench 
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12006 JVerbjg^etalj 2008l ) , millisecond p ulsar polarime- 
try Ord et all 120041 ). giant pulses (e.g. iKnight et~aL 

2006) , and dispersion measure variations l|You et al. 

2007) , studies of g i ant pulses from the Crab Nebul a 
pulsar <|Bhat et all 120081 ; iKaruppusamv et all 120101 1. 
the first observations of rotating radio transient polar- 
isation dKarastergiou et~af 2009), t he search for gravi- 
tational waves IjVerbiest et al.ll2009l ; lYardlev et al.l2010i l. 
and an analysis of scattering by the interstellar plasma 
(jColes et all 120101 ). dspsr has also been used in the 
analysis of giant pulse data from the Mileura Wide- 
field Array Low F requency Demonstrator (MWA-LFD; 
iBhat et al.ll2007T l. and early pulsar observations made 
with the Low Frequency Array Core Station 1 (LOFAR 
CS1; B. Stappers 2009, private communication). 

Data recorded at most major radio observatories 
are understood by DSPSR, which can be used to process 
either voltage- or power-level signals (i.e. undetected 
or detected data). For example, DSPSR can process 
baseband data encoded using the VLBI Data Inter- 
face Format (VDIFj3 

as well as detected data stored 
using the psrfits file format. The library also con- 
tains routines to both read and write files in the for- 
mat used by SIGPROC, another popular signal process- 
ing package that is commonly used in the search for 
new pulsars, making DSPSR a useful component in sur- 
vey data reduction pipelines. Although most of the 
design and testing of the software has been completed 
for the analysis of radio pulsar observations, the large 
majority of the algorithms may be applied to any sin- 
gle beam (either single-dish or phased-array) radio ob- 
servation. For example, the DSPSR library could be 
utilised to produce high-resolution spectra for studies 
of large s cale structure in extragalac tic neutral Hydro- 
gen (e.g. IStavelev-Smith et "aDl200^ . 

Future developments of DSPSR may include the adop- 
tion of industry standards like the Open Computing 
Language (OpenCLj3f° r portable GPU programming. 
GPU performance may in turn facilitate the computa- 
tion of new statistical quantities. For example, the 
cyclostationary statistics of the electromagnetic field 
promises new insight into the phenomenon of diffrac- 
tive scintillation (P. Demorest 2008, private commu- 
nication) and may enable the inversion of multipath 
scattering effects th at arise in the interstellar medium 
(jWalker et al.l l2008). Although formation of the cyclic 
power spectrum is computationally prohibitive, it may 
be feasible to regularly observe this statistic using GPU 
technology. 

DSPSR provides the capacity for real-time phase- 
coherent dispersion removal over a comprehensive range 
of observing configurations, including those planned 
for the upcoming generation of radio telescopes such 
as the Au stralian Square Kilom eter Array Pathfinder 
(ASKAP; I Johnston et al.1 120081) a nd the Karoo Array 
Telescope fMeerKAT: ! Jonasl2 009) . The ASKAP Boolardy 
Engineering Test Array (BETA) will operate between 
700 MHz and 1.8 GHz and provide a 2 x 300 MHz band. 
The MeerKAT Precursor Array (MPA - also known as 
KAT-7) will provide a 2 x 256 MHz band within the 
frequency range from 1.2 to 1.95 GHz. Benchmarks 

1:L http: / /www. vlbi.org/vsi 
12 http:/ /www. khronos.org/opcncl 



demonstrate that DSPSR can comfortably process these 
signals in real time using only 4 workstations, each 
equipped with a single GPU. Considering its exem- 
plary performance, extensive functionality, and degree 
of maturity, both scientists and engineers are encour- 
aged to take advantage of this well-tested open-source 
digital signal processing library for radio astronomy. 
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A Floating Point Operations 

The number of floating point operations required to 
create a synthetic filterbank and perform phase-coherent 
dispersion removal is a function of the number of fre- 
quency channels produced by the synthetic filterbank, 
A c , and the size of the dispersion response function in 
the output frequency channel with the lowest centre 
frequency, N' . This is true for both real- valued and 
complex-valued input data. Although twice as many 
real-valued input time samples are required to obtain 
the same resolution in the frequency domain, the 2N- 
point real-to-complex Fast Fourier Transform (FFT) is 
readily computed with roughly the same efficiency as 
the Appoint complex-to-complex F FT, as described i n 
Section 12.3 of Numerical Recipes (jPress et al.lll992h . 

In the case of the deprecated filterbank, the for- 
ward A c -point FFT is repeated N' times, requiring 

N' x 5A C log 2 N c 

operations. Convolution is then performed indepen- 
dently on each of the N c channels using a forward and 
backward A'-point FFT, costing 

2A C x 5 AT' log 2 N' 

operations. In the case of the convolving filterbank, 
the single N C N' forward FFT requires 

5N C N' log 2 N C N' 

operations. For each of the N c channels, a backward 
A'-point FFT is performed, costing 

N c x 5N' log 2 N' 
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operations. Although both methods require a total of 

5iV c iV'(log 2 iV c + 21og 2 N') 

floating point operations, the speed with which those 
operations are performed will depend on the imple- 
mentation details of the FFT and the architecture of 
the computing device. As shown in Figure [S] FFT li- 
braries typically achieve greater efficiency as the trans- 
form size is increased until the problem no longer fits 
in LI cache. Therefore, if N c is smaller than the LI 
cache-limited transform length, then the convolving fil- 
terbank may provide better performance than the dep- 
recated method. However, if the product of N c x N' is 
large, such that the memory required to compute the 
FFT is larger than the L2 cache size, then the dep- 
recated method may provide better performance. Re- 
gardless of computational performance considerations, 
in light of the spectral leakage artifacts introduced by 
the deprecated filterbank method (see Figure |4j), use 
of the convolving filterbank is recommended. 
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